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Abstract 

Where causal SNPs (single nucleotide polymorphisms) tend to accumulate within bio- 
logical pathways, the incorporation of prior pathways information into a statistical model is 
expected to increase the power to detect true associations in a genetic association study. Most 
existing pathways-based methods rely on marginal SNP statistics and do not fully exploit the 
dependence patterns among SNPs within pathways. We use a sparse regression model, with 
SNPs grouped into pathways, to identify causal pathways associated with a quantitative trait. 
Notable features of our pathways group lasso with adaptive weights (P-GLAW) algorithm in- 
clude the incorporation of all pathways in a single regression model, an adaptive pathway 
weighting procedure that accounts for factors biasing pathway selection, and the use of a 
bootstrap sampling procedure for the ranking of important pathways. P-GLAW takes ac- 
count of the presence of overlapping pathways and uses a novel combination of techniques to 
optimise model estimation, making it fast to run, even on whole genome datasets. In a com- 
parison study with an alternative pathways method based on univariate SNP statistics, our 
method demonstrates high sensitivity and specificity for the detection of important pathways, 
showing the greatest relative gains in performance where marginal SNP effect sizes are small. 



1 Introduction 



The mixed success of attempts to identify genetic variants that account for a large part of the 
hcritability of common disease has focussed attention on the need to develop new methodological 
approaches to the analysis of GWAS data. A number of factors that might explain this 'missing 
hcritability' have been suggested, including the failure of many current models to capture the 
presence of gcnc-gcnc and gene-environment interactions, of multiple SNPs with small effect and 
of rare variants (Manolio et al. 2009 Goldstein 2009). One promising approach uses prior in- 



formation on functional structure present within the genome to group genes and associated SNPs 
into gene sets or pathways. The motivation here is that genes do not work in isolation, but instead 
work together through their effect on molecular networks and cellular pathways. The hope is that 
by jointly considering the effects of multiple SNPs or genes within a biological pathway, significant 
associations might be identified that would otherwise be missed when considering markers indi- 



vidually (Wang et al. 2010). First developed in the context of gene expression studies (Mootha 



et al. |2003 ), pathways-based methods have more recently been extended to the analysis of GWAS 



data (Holmans et al. 2009 



2010 


Lango Allen et al. 


2010 


Lambert et al. 


2010 



has led to the identification of putative causal pathways for a number of diseases including Parkin- 



son's Disease ( Lesnick et al. 2007 1 , Crohn's Disease ( Wang et al. 2009b ) and rheumatoid arthritis 
( Eleftherohorinou et al. 2011). As well as offering the potential for increased statistical power, 
pathways-based genetic association studies (PGAS) can aid the biological interpretation of results 
through the identification of causal pathways, and may also facilitate comparisons between studies 



gcnotyping different variants that nonetheless map to common pathways (Ma and Kosorok 2010 



Cantor et al. 2010) 



The majority of existing PGAS methods begin with a univariate test of association, in which 
individual SNPs are scored according to their degree of association with disease status or a quan- 
titative trait. Various techniques are then used to combine these univariate statistics into pathway 
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scores. For example, the GenGen method (Wang et al. 2007) first ranks all genes according to 
the value of the highest-scoring SNP within 500kb of each gene. Pathway significance is then 
assessed by determining the degree to which high-ranking genes are over-represented in a given 
gene set, in comparison with the genomic background. The PLINK tookit (Purcell et al. 20071 



also features a 'set-based test', in which pathway significance is measured by taking the average, 
marginal p-value of a pre-determined maximum number of 'uncorrelated' SNPs within the path- 
way. Here, uncorrelated SNPs arc defined as those whose pairwisc linkage disequilibrium (LD) is 
below a certain threshold value. As a final step, where more than one pathway is considered a 
correction for multiple testing is generally made. 

In contrast to univariate, 'one SNP at a time' methods, multivariate or multi-locus methods 
allow all SNPs to be considered in the model at the same time, which can aid the identification of 
weak signals while diminishing the importance of false ones. One such approach consists of fitting 
a penalised, multivariate regression model, in which a subset of SNPs is selected by imposing 
a penalty on some suitably selected norm of the regression coefficients, as in Lasso regression 
(Tibshirani 1996). This approach has been shown to yield higher statistical power, compared to 



more common 'mass univariate linear models', especially with multivariate and high-dimensional 
quantitative traits (Vounou et al. 2010). Several other studies have demonstrated the advantages 
of this approach for the detection of genetic associations. For example, Wu et al. (20091 use 
penalized logistic regression to select SNPs in a case-control study, and analyse two-way and 
higher-order SNP-SNP interactions. Hoggart et al. (20081 propose a similar method for SNP 



selection in a Bayesian context. 

A number of penalized regression techniques that allow prior information on the relationship 
between SNP markers to be incorporated into the model selection process have recently been 
proposed. For example, Zhou et aT| ( 2010) group SNPs into genes, and utilise a useful property 
of the group lasso ( Yuan and Lin| 2006) to aid the detection of rare variants within genes. The 
GRASS method (Chen et al., 2010) begins by characterising within-gene variation as 'eigenSNPs', 
obtained by principal component analysis (PCA). A combination of lasso and ridge regression, 
followed by permutations is then used to measure significance for a single pathway. Finally Zhao 



et al. (2011) use a combination of PCA and lasso regression to identify a subset of genes within 



a candidate pathway, followed by permutations to measure pathway significance. Once again this 
method considers one pathway at a time. 

The search for SNPs, or quantitative trait loci (QTL) influencing quantitative traits is gaining 
momentum as a potentially more powerful way to study the underlying causes of complex disease 
(Plomin et al. 2009). In the emerging field of neuroimaging genetics for example, in which we 



have a particular interest, quantitative data in the form of MRI or PET scans serve as a type of 
intermediate phenotype in the study of complex disorders such as Alzheimer's Disease (AD) or 
schizophrenia (Bigos and Weinberger 2010). We use genotype data from the Alzheimers Disease 
Neuroimaging Initiative (ADNI) dataset in this analysis. 

Our focus here is on the identification of biological pathways associated with a quantitative 
trait. Our assumption is that where causal SNPs are enriched in a pathway, the use of a regression 
model that selects SNPs that are grouped into pathways will have increased power, compared to 
a more traditional approach in which SNPs are considered one at a time. We also seek a true, 
multivariate model which includes all mapped pathways at the same time. The hope is that this 
will confer some of the benefits, in terms of detecting weaker signals and diminishing false positives, 
described earlier. To achieve these ends, we use a modified version of the group lasso (GL) with 
SNPs grouped into pathways, and develop a fast estimation algorithm applicable to the case of 
non-orthogonal groups. In order to rank pathways, we use a bootstrap sampling procedure to rank 
pathways in decreasing order of importance. We face a number of challenges in applying GL to 
SNP and pathway data for the identification of implicated pathways. These include the fact that 
pathways overlap, since many SNPs map to multiple pathways; the problem of selection bias, that 
is the tendency of the model to select pathways having specific statistical properties irrespective of 
their association with phenotype; and the sheer scale of SNP datasets, making efficient estimation 
a necessity. 

We have found that the issue of overlapping pathways receives surprisingly little attention in 



the PGAS literature, given that the presence of overlaps might be expected to have a significant 
impact on the results of any PGAS analysis. For example, variation in the number and distribution 
of causal SNPs with respect to genes that overlap multiple pathways will affect the number of 
pathways defined to be 'causal', and different PGAS methods will be affected by such variation in 
different ways. Additionally, the inclusion of multiple pathways in a single GL regression model 
presents a particular problem, since GL in its original formulation will not select pathways in 
the manner that we would wish. To account for this we employ a variable expansion procedure, 



originally proposed in the context of microarray data analysis by Jacob et al. ( 2009 ) , that ensures 



that overlapping SNPs enter the regression model separately, for each pathway that they map to. 

A number of factors may bias PGAS results, exaggerating pathway significance and giving 
rise to inflated numbers of false positives. Depending on the methods used, and the underlying 
disease-causing mechanism, such factors are likely to include pathway size (measured in number of 
SNPs and/or genes), and the extent and distribution of pathway LD. Common strategies employed 
by existing methods to reduce this bias include the use of permutation (of genes or phenotypes), 



and dimensionality reduction techniques such as PCA (Fridley and Biernackal 2011 Wang et al 



2010). We propose a procedure that reduces bias by adjusting pathway weightings in the regression 
model according to the empirical bias in pathway selection frequencies obtained by fitting the GL 
model with a null response. 

One potential drawback of using a regression model in the analysis of genetic data is the 
typically very large number of predictors (here SNPs) that must be analysed. While the use of 
penalized regression techniques at least makes the problem tractable when the number of predic- 
tors vastly exceeds sample size, the very large matrix calculations required can still make model 
estimation computationally infeasible. To address this, we combine a number of techniques that 
speed up the estimation process including the use of an 'active set' of predictors, a Taylor ap- 
proximation of the GL penalty and efficient computation of pathway block residuals. The final 
estimation algorithm, which we call 'Pathways Group Lasso with Adaptive Weights' (P-GLAW), 
is sufficiently fast to obviate the need either to undertake a preliminary stage of dimensionality 
reduction, or to consider pathways individually. 

We evaluate our method's performance in a Monte Carlo (MC) simulation study, using real 
genetic and pathway data with quantitative phenotypes simulated under an additive genetic model. 
We consider a range of scenarios with different causal SNP distributions and effect sizes. We feel 
the use of real genotype and pathway data is crucial, so as to capture the complex distributions 
of gene size and number within a pathway, together with SNP LD patterns and overlaps between 
pathways, all of which may have a significant effect on pathway ranking performance. To our 
knowledge, this is the first such PGAS power study using GL with real SNP and pathway data. 
The evaluation of GL pathway ranking performance however presents a number of challenges. 
Firstly, as described above, variation in the number of causal pathways due to overlaps must be 
taken into account when evaluating performance over multiple MC simulations. Secondly, we 
require a means of evaluating the degree to which causal pathways are represented amongst the 
top ranks. Thirdly, since GL performs variable selection, not all causal pathways may be ranked, 
and ranking performance measures must reflect this. To address these issues we devise a battery 
of measures that aim to capture different aspects of ranking performance. Finally, we compare 
our method's performance with another common PGAS method, derived from univariate SNP 
statistics. 

The article is organised as follows. Section [2] describes the GL model; our strategy for dealing 
with overlapping pathways, model estimation and speed-ups; our proposed bias-adjusted pathway 
weighting update procedure; our strategy for ranking pathways using a resampling procedure, and 
our proposed ranking performance measures. In Scction[3]we describe the real biological data sets 
used in the experiments, the SNP to pathway mapping process, and the simulation framework 
used to evaluate both methods under consideration. The results from these simulation studies are 
provided in Section |4| and we conclude in Section [5] with a discussion and final remarks. 



2 Methods 



2.1 The group lasso for pathway selection 

We assume N unrelated individuals genotyped at P SNPs, each with a univariate quantitative 
trait yi, for i = 1, . . . , N. For an individual i, we denote by the minor allele count for SNP j, for 
j = 1, . . . , P, and arrange these counts in an (N x P) design matrix X. Quantitative phenotypes 
are arranged in an (N x 1) column vector y, and will be treated as quantitative responses in a 
regression model. 

We initially consider the situation where SNPs are partitioned into L mutually exclusive path- 
ways, or groups. Each group Gi, for I = 1, ... , L, is a subset of {1,2,..., P} of cardinality Si, 
containing the indices l\, I2, ■ ■ ■ , ls t of the SNPs that belong to it, such that Gi HGv = for any 
I ^ I'. We denote by = {1, ... , P}, the set of all SNP indices. We denote by S C {1, . . . , P} 
the subset of SNPs that are causal, that is the SNPs influencing y, and additionally denote the 
cardinality of S by 5*. Accordingly, we denote by C C {1, 2, . . . , L} the subset of causal pathways 
containing one or more SNPs in S, having cardinality \C\. We denote the complement of C by C. 
We also assume that \C\ CL, so that only a small proportion of all pathways are causal. Finally, 
we assume that y can be optimally predicted, in the least squares sense, by a linear combination 
of allele counts corresponding to SNPs in pathway Gi, where / belongs to the set C. 

We denote the vector of SNP regression coefficients (3 = (j3i, . . . , ftp) £ ]R P , and the parameter 
vector corresponding to SNPs in pathway Gi only as /3 ; = (/^ , . . . , /3; s ) £ R s ' . Under these 
assumptions, one or more elements of each f3i for I £ C are expected to be non-zero, whereas all 
the regression coefficients associated with SNPs that do not belong to C will be zero, that is fli = 
for I £ C . For example, for a single causal pathway Gi with causal SNPs {a, b} in S, the sparsity 
pattern might look like 

/9 = {((V^o), . . . , (o,...,a o ,o ft t ,o,...,o) , . . . , (o 1 __o)}. 

Si Gi Gl 

A suitable regression model that would enforce the assumed block sparsity pattern above is 



the group lasso (GL) (Yuan and Lin 2006), in which estimates for /3 are obtained by minimising 



the penalised least squares function 



(1) f((3) = h\y-X(3\\l + \jr wi \\(3 l \\ 2 

1 1=1 

with respect to /3, where || • || 2 denotes the £2 (Euclidean) norm and wi is a pathway weighting 
factor for group I. Sparsity at the pathway level is encouraged through the imposition of an l\ 
lasso penalty on ||/3/||2, which ensures that SNPs belonging to pathways not selected by the model 
have zero regression coefficients. For selected pathways, i.e. those with /3; ^ 0, SNP coefficients 
tend to shrink, through the imposition of a ridge-type penalty on ||/3;||2- The degree of sparsity is 
controlled by the regularisation parameter, A, such that the number of pathways selected by the 
model increases with decreasing A. For a given A, the block sparsity pattern is determined both 
by the data (y and X), and by the distribution of pathway weights, w — (wi, . . . ,wi), such that 
an increase in wi means that pathway I is less likely to be selected, whereas a decrease in wi will 
have the opposite effect. 

The GL optimisation problem associated with minimising the objective function ([T]) is convex, 
and can be solved using coordinate descent methods. Problems arise however in the situation 
where pathways overlap, that is when a SNP is allowed to belong to more than one pathway, so 
that Gi n Gi' ^ for some I ^ I'. Firstly, where groups overlap, the penalty term in ([!]) is no 
longer separable into groups, since the same SNPs occur in multiple pathways, and convergence 



using coordinate descent is no longer guaranteed (Tseng and Yun 2009). Secondly, if we wish 
to be able to select pathways independently, GL is unable to do this. We illustrate this last 
point using a simple example in Fig. [T] A, where we consider only three pathways, Gi, Gi and 



(?3, two of which overlap. As a consequence of this, pathway parameter vectors f3± and (3% also 
overlap, since they have a number of SNPs in common (shaded dark grey). If a shared SNP is 
selected (i.e. it has a non-zero coefficient), then both pathways to which it belongs (Gi and G2) 
are also selected, since their corresponding pathway parameter vectors have non-zero £2 norms. 
The GL regression model thus does not meet our requirements, since in order to be able to rank 
pathways in order of importance, we wish to be able to distinguish overlapping pathways and 
select them independently. Conversely, where shared SNPs have zero coefficients, for example in 
the case that Gi is not selected in the model, then these SNPs will have zero coefficients in each 
and every pathway to which they belong (here Gi and C^)- Hence SNPs retained in the model are 
necessarily drawn from the complement of a union of (unselected) pathways. We instead require 
retained SNPs to be drawn from a union of (selected) pathways, so that a SNP driving selection 
in one pathway may still have a zero coefficient in another. 



l! ni 
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Figure 1: The problem of overlapping pathways: here there are three pathways, Gi,Q2 and <?3, two of 
which overlap. A: Standard formulation. Pathway parameter vectors f3i and /32 overlap, since they have 
SNPs in common (shaded dark grey). Where an overlapping SNP has a non-zero coefficient, only C/3, can 
be selected independently. B: Formulation with duplicated SNPs. An expanded parameter vector, /3* , 
is created by duplicating overlapping SNPs (dotted line). /3i and /3J now enter the model separately, so 
that pathways can be selected independently. 



Jacob et al. ( 2009 ) propose one possible solution to the problem of overlapping predictors in a 



similar context, motivated by the analysis of gene expression data. The essence of this method is to 
create duplicate, dummy SNPs, so that SNPs belonging to more than one pathway enter the model 
separately (see Fig.[l]B). The process works as follows. An expanded design matrix is formed from 
the column- wise concatenation of the L sub-matrices of size (N x Si), that is Xg ; = {syj with 
i = 1, . . . , N and j £ Gi, to form the expanded design matrix X* = [X.g 1 , Xg 2 , . . . , ~Kg L ] of size 
(N x P*), where P* — J2i The corresponding parameter vector, (3* , size (P* x 1), is formed 
by joining the L, (Si x 1) pathway parameter vectors, f3j, so that (3* = [(3l T ', fi*^ \ . . . , /3* L T ] T . The 
model is then able to perform pathway selection in the way that we require, and the optimisation 
([T]), with (3 replaced by (3*, and X replaced by X* becomes block separable, so that it can be 
solved by block coordinate descent. In the following sections we assume both (3 and X have been 
expanded, but omit the * superscript for clarity. Finally, we note that where one or more SNPs 
in S overlap multiple pathways, the corresponding number, \C\, of causal pathways will increase. 



2.2 Parameter estimation 



We seek a solution, /3, that minimises the GL objective function (JlJ. Where groups or pathways 
are disjoint, so that the penalties are separable into groups, a global solution can be obtained using 
block coordinate descent (BCD). Coordinate descent algorithms offer a highly efficient means of 
solving convex optimisation problems, and work by breaking down the optimisation into a series of 
univariate problems, solving the optimisation for each variable (here SNP) in turn, while holding all 
the others fixed, until a suitable minimum based on some stopping criterion is reached ( |Friedman| 



et al. 2007). Where variables are grouped, as in GL, estimates are obtained for each pathway 
parameter vector, 0i in turn, while holding constant the current estimates for all other pathway 
parameter vectors, /3 m , (m ^ I), and then cycling through each pathway until convergence. 



Yuan and Lin (2006) derive a method for solving GL under the assumption that the group 
design matrices, Xg ( are orthogonal, that is Xg, = I. This assumption does not hold in our 
case, so in the next section we derive a solution for GL in the case of non-orthogonal groups. We 
additionally find that GL estimation using BCD can be slow, particularly for the large datasets 
common to PGAS, and so in the following sections propose a number of strategies for speeding 
up parameter estimation. 

2.2.1 Block coordinate descent for non-orthogonal groups 

We assume that ([I]) is block-separable, that is the groups indexed by 1,...,L are disjoint by 
construction. In our context, this is achieved by implementing the SNP duplication strategy of 
section [2~Tj We begin by considering a single pathway I. We collect the N individual observed 
SNPs for a given SNP j in a column vector Xj = (xij,X2j, ■ ■ ■ , xjvj)- Using this notation, we 
define the matrix Xg ( = (Xi 11 Xi 2 , . . . ,Xs t ) containing all Si SNPs belonging to pathway Gi, and 
the corresponding vector of regression coefficients (3i = (fi^ , /?/ 2 , . . . , f3s t )- We can then rewrite the 
objective function ([I]) for a single block I as a function of f3i, 

(2) /(A) = ^||fi-E^||+A^||A||2 

where f \ = y — J^m^i ^-m0m- The vector f/ is the 'partial residual' vector for pathway I, based 
on the current estimates, (3 mi m =/= I, of the other pathway parameter vectors. 

Estimates for each j5j are then obtained by taking partial derivatives with respect to /3j, that 
is by setting 

3) = for j = h,..., Si 

Ignoring the penalty term, the partial derivative with respect to f3j is 

3 j j 

We denote the partial derivative of the penalty term, by 

so that ([3]) can be written as 

(4) -Xf(r l -J2 x ^j) + ^w l s j =0 j = h,...,Si 

3 

We first consider the case where (3i = 0, that is fij — 0, for j — lx, . . . , Si. In this case \\P1W2 
is not differentiable. We instead form the Si sub-differentials, Sj € [—1,1], so that 

(5) E^ 1 



The system of equations Q can now be written 

1 



and using ([5]), we have 
(6) 



j A 2 



Note that for ( |6"|) to be unbiased with respect to group size, a weight, Wi 
Yuan and Lin (2006), can be applied. Alternatively, since 



'Si, as proposed by 



we may rewrite ^ as 



so that if f3i = 



E(^) 2 = 11x^11 

(E4)-^iix^ii 2 <i, 



(7) 



When [3i 7^ 0, the minimisation of ([2| can be obtained numerically, using coordinate descent 



as a series of one-dimensional estimations over j3j,j = li,...,lg r Friedman et al. (2010) suggest 



a golden section search over 0j, combined with parabolic interpolation. However, the number of 
such estimations depends on L and P* , both of which increase with P, the latter markedly so. 
This can make the GL optimisation prohibitively slow, particularly for the large P typically found 
in PGAS. For this reason, we next describe three strategies for speeding up the estimation. 

2.2.2 Taylor approximation of penalty 

One means of speeding up the estimation for j3j is to use a linear or quadratic approximation of the 



GL £2 penalty (Zou and Li 2008 Fan and Li, 2001), enabling the replacement of the multi-step 



numerical optimisation over f3j with a one-step calculation. Breheny and Huang ( 2009 ) propose the 



use of a Taylor approximation for a range of different estimation problems with grouped variables 
and we adopt this approach for our GL estimation problem. We begin by rewriting the group Qi 
objective function ([2]), for a single predictor as 

f(Pl\Pk, k e Qu k ^ j) = ~||r, - ]T Xj k - XjftWl + XwiT(pi\p k ) 

k 

where T(Pi\(3k) = (c + fl]) 3 , with c = X)fe=y^fci an d the j3k are the current SNP coefficient 
estimates. For convenience, we rewrite this as 



(8) 



1 



f(Pi\Pk, k^j) = -\\v + Xrfj - X&W* + \ Wl T{Pi\p k ) 



where f = y — J2i X//3/ is the total residual, using the current estimates of all SNP coefficients. 
We now consider the first order Taylor expansion of T(Pi\f3k) as a function of x = about the 
point a = /3 2 



r(a;)~r(o)+r'(o)(x-a) 



Now 



r(ar) = (c+xy 
1 



and r'(a) = 



2(c + a)5 



so that 



T(x) ~ (c + a)5 + 



2(c + a)5 



Substituting a = ft, and noting that (c + a) 2 = ||/3/||2, where 0i denotes the current estimate of 
Pi, this gives 



res?) ~ Pi 



PI-PI 
2IIAII2 



Substituting this expression in (18]), we have 



fc ^ j) = -||r + Xjfo ~ X^Wl + \ Wl P 



Differentiating with respect to /?,- gives 



ft -ft 



2||/3, 



(2 



a/(A) 

a& 



-Xfp + Xjfc-XjPA + XwtJ^- 

IIAH2 



-Xjr- 



Pi + pj + \ Wl 



HA 



XjXj = 1. Rearranging terms and setting the partial derivative equal to zero, we 
sec that the minimum is achieved when 



since ^ x\ 3 



(9) 



xji + 



where A' = 



Xwi 

\W 



l\\2 



Where the current estimate HAH2 = 0, that is when group I first enters the estimation, we set 
HAII2 to be a small positive quantity, rj, enabling Pj in ^ to be estimated. 

BCD proceeds by obtaining estimates for each Pj,j = lx,...,Si,l,...,Si,... until convergence 
within the block, and for each pathway, I = 1, . . . , L, 1, in turn, until a stopping criterion 

indicating a global minimum of ([T]) has been satisfied. The estimation process is summarised in 
Box[U 



2.2.3 Use of pathway 'active set' 

For large P* and L, the need for the repeated calculation of Q to establish whether or not 
a particular group can enter the estimation presents a major computational bottleneck. This 
problem motivates another strategy providing substantial gains in computational efficiency for 
a range of sparse regression problems. This 'active set' strategy relies on the pre-selection of a 
subset of 'potentially active' predictors, or groups of predictors that are likely to be selected by 
the model at a given A (Tibshirani et al. 2010 Roth and Fischer 2008). The optimisation can 



then be run over this reduced set of variables, with a subsequent check to ensure that no other 
predictors should have been included in the first place. The active set procedure offers potentially 
dramatic speed up in execution times, particularly for very large datasets such as those found in 
PGAS, due to the reduced number of computations that need to be performed. In addition there 
are substantial savings in the amount of memory required to store data during processing, which 



Box 1 GL estimation algorithm using BCD 



1. set = 0. 

2. For pathway Qi, I = li, 2, . . . , L: 

set r t = y - J2 m ^i x ™An 
li\\X[vi\\ 2 <Xwt 
set $i = 

else 

do 

for j = h,...,Si 

estimate 0j using ^ 

end 

until convergence of f(Pi) Q 
set 0i = ft 

end 

end 

3. Repeat step 2 until (global) convergence of 



can also lead to dramatic reductions in computation times with large datasets where memory is 
constrained. 

For the GL, we begin by considering the inequality Q. For groups to enter the model we 
require 



(10) 



\Xfh\\2>Xwi l = l,...,L 



and therefore, at the first iteration, with /3 initialised to zero, a group Qi enters the model if 
(11) ||Xfy|| 2 >A W , I = !,...,£. 



We define the 'active set' A of potentially active groups that satisfy ( 11 ) as 

A = {meG: ||X£y|| 2 > Xw m } 



and additionally define 
(12) 



A 



max — mm 
A 



|Xfy|| 2 < Xwi 1 = 1,. 



,L 



namely the smallest A value for which the active set is empty. Note that provided A is close to 
Amaa;, then \A\ <C L. Once one or more groups enter the model, not all 0\ will be zero and the 
inequality ( 10 ) will then determine which groups may enter or leave the model. 



The active set procedure rests on the observation that in practice, the final set of groups 
selected by the model rarely includes any groups not in A (Tibshirani et al. 2010). We can 
therefore perform the full estimation on A, followed by a check of the inequality (10), to see if 



any additional groups not in A can enter the model. If there arc no additional groups, then we 
have the full solution. If not, then we run the full estimation again, with the additional groups 
satisfying ( 10 ) added to A. A summary of the active set algorithm is given in Box [2] 



2.2.4 Efficient computation of block residuals 

A further, large computational burden results from the repeated calculation of the residuals r; 
and r in Q, ^ and (10). The computational overhead for these calculations is substantial, 
both because of the size of the expanded design matrix (N = 743 and P* = 66, 085 in the 



Box 2 Active set algorithm for a single A value 



1. Form the active set, A — {m € Q : ||X^y||2 > Aw m } 

2. Set f3 = 0, and solve the GL estimation at A, using only the groups in A: 

/3 = mindly - ^ X m /3 m ||| + A ^ w m ||/3 ro || 2 

3. Compute the revised active set on the full dataset: 

A+ = {zeG :||Xjr 2 || 2 > Xw z } 

if A+/A = % 

(3 is the full solution 

STOP 
else 

set A = A + 

repeat 2. and 3. with the new, (expanded) active set 
end 



simulation study described in section [3] but substantially larger for a full PGWAS), and because 
of the iterative nature of the BCD algorithm, meaning that a very large number of calculations are 
performed. We therefore achieve one further substantial gain in computational efficiency by noting 
that since the blocks are separable, during BCD only the single block residual, h; = y X;/3;, 
changes between iterations j = 1, . . . , Si, 1, . . . , Si, . . . within block I, and between iterations I = 
1, . . . , L, 1, . . . , L, . . . across blocks. We therefore only need update h; at each iteration, with r and 
r/ updated using computationally inexpensive matrix subtractions and additions. Python code for 
mapping SNPs to pathways, and for analysing SNP data using PGLAW is available on request. 



2.3 Selection bias and pathway weighting 

PGAS methods derived from univariate SNP statistics are subject to various biasing factors that 
can influence pathway ranking under the null, where no SNPs influence the phenotypic trait, y. 
These factors vary from method to method, but may include the number and size of genes in a 
pathway, as well as LD between SNPs and genes. Such biasing factors are generally corrected 



through the use of permutation procedures. For example, the 'GenGen' method (Wang et al 



2009b), measures the degree to which pathways are enriched with high ranking genes, and is sub- 
ject to bias due to variation in the number of SNPs mapped to a gene, and to differences in LD 
between SNPs mapped to different genes. The bias correction procedure begins by forming multi- 
ple datasets through permutation of phenotype labels. For each permuted dataset, gene scores are 
generated from univariate SNP statistics, and a pathway enrichment score is calculated. A nor- 
malised (bias-corrected) pathway enrichment score is then derived by comparing the distribution 
of pathway scores under the null with the score obtained from the unpermuted data. 

Regression-based methods are similarly prone to bias, and once again the use of permutation 
has been proposed to correct for this, along with dimensionality reduction to extract non-redundant 



information. For example, with the GRASS method for case-control data (Chen et al. 2010), ge- 
netic information within each gene is first summarised as 'eigenSNPs', obtained through PCA. The 
biasing effect of gene size is once again accounted for through the generation of a null distribution, 
formed by permuting phenotype labels. 

With the GL under the null, pathway selection will be influenced by pathway size (i.e. the 
number of SNPs within a pathway), since the accumulation of spurious associations in larger 
pathways will give rise to larger ||/3;||2 in ([!]). In addition, variation in dependencies between 



SNPs within pathways, and to a lesser extent between pathways will give rise to corresponding 
variations in ||/3;||2 where spurious associations arise in regions of high LD. 

One way to correct for biases arising from variations in the statistical properties of different 
pathways or groups is through the selection of appropriate group weights w = (wi, . . . , u>l) for the 
objective function ([lj. For example, as noted before, Yuan and Lin (20061 suggest one possible 
choice for the pathway weighting would be 

(13) vn = y/Si 

which ensures that groups of different size are penalised equally, and so have an equal chance of 
being selected by the model, other things being equal (see ffify). In principle, we could follow this 
strategy and perhaps attempt to account for other, additional factors that may also bias pathway 
selection. However, there are a number of problems with this approach. Consider for example the 
biasing effect of dependencies between SNPs within a pathway. Where causal SNPs tag, or reside 
within large blocks with strong LD, the pathway 'signal' will be high, increasing the chance that 
such pathways will be selected by the model, compared with other pathways where LD is low. 
This biasing effect will further depend on the distribution of LD within the pathway, which will in 
turn depend on other factors such as the number and size of pathway genes. The precise form of 



any additional term(s) that should be added to ( 13 1 to account for this bias is thus unclear. Even 



if we were able to identify a list of potential biasing factors, and formulate bias-correcting weight 
adjustments for each, we are still faced with the problem that their may be other, unknown factors 
that contribute to the bias. We therefore choose to adopt a 'hypothesis-free' approach to adjusting 
pathway weights, which makes no assumptions about those factors which might influence pathway 
selection. 

Consider pathway selection under the GL model ([!]), with A tuned to select M pathways. 
We begin with the case M = 1. When there is no selection bias, and assuming no genetic 
association, a pathway Qi should be randomly selected by the model according to a uniform 
distribution, namely with probability II; = 1/L, for I = I , . . . , L. However, when biasing factors 
are present this is generally not the case, and the empirical probability distribution describing 
pathway selection, II* (w) will not be uniform. Here the dependence upon the weight vector w 
has been made explicit, since with A tuned to select a single pathway, w alone determines the 
frequency distribution. A measure of distance between these two distributions can be obtained by 
computing their Kullback-Leibler (KL) divergence 

(14) ^ = ^nrMiog n ^ ) 

I 1 

where n ; *(u;) is the empirical probability for the selection of pathway Qi under the assumption 
of no genetic associations. When GL pathway selection is unbiased, we expect this distance to 
be approximately zero. Our strategy consists in adaptively adjusting all weights w in order to 
minimise D. 

Our adaptive weighting procedure is an iterative one, whereby at each iteration r we first 
update the previous weight vector u/ 7 " -1 ), and then re-estimate H*(w^) by fitting the GL model 
R times, each with a random permutation of the response in order to create R null data settj^J 
H*(w( T >) is then the frequency at which pathway Qi is selected across the R null data sets at 
iteration r. The algorithm is initialised at iteration r = by using an initial weight vector w^°\ 



for instance the standard size weighting (f3). This procedure is then repeated until D reaches 
some suitably small value. 



From (14), a reduction in D can be obtained by reducing the difference di = IL^w) — n;, for 
all /. As each |ef|| approaches zero, the ratio, IL*(w)/Ui, approaches one, so that the contribution 
of pathway Qi to D is decreased. With this in mind, at each iteration, we adjust pathway weights 
according to the following formula, 

(15) wl T) = w[ T ~ 1] [l-sign(dO(a- l)L 2 df] 0<a<f 

2 Alternatively, in a simulation study where the null distribution of the response is known (as in section pjjl, the 
R models can be fitted after sampling a response from that null distribution. 



where the paramater a controls the maximum amount by which each wi can be reduced in a 
single iteration, in the case that pathway Q\ is selected with zero frequency. The weighting update 
equation has the following desirable properties. When < II* < II;, i.e. — ~ < di < 0, wi is 
decreased, up to a maximum factor a when 11;* = 0, increasing the chance that group / is selected. 
When II* > Hi, i.e. di > 0, wi is increased, decreasing the chance that group I is selected. Finally, 
when II* = H, i.e. di = 0, wi is unchanged. The square in the weight adjustment factor ensures 
that large values of \di\ result in relatively large adjustments to W[. 

The estimation of II* when M > 1, that is where more than one pathway is selected by the 
model, is computationally infeasible even for a small value of M, since we would need to estimate 
the empirical joint probability distribution that M pathways are jointly selected. However, we 
expect that many of the factors biasing pathway selection when M — 1 will similarly affect this 
joint probability distribution. Under this assumption, we estimate the optimal weight vector w 
only in the M = 1 case. Extensive simulation studies (see section H) indicate that this data-driven 
adaptive waiting scheme is able to substantially increase power and specificity compared with the 
standard weighting (13), even when M > 1, indicating that this assumption holds in practice. 
Finally, we note that despite the need for multiple MC simulations over multiple iterations, our 
proposed bias-adjusted weighting strategy is fast, since it relies on fitting the GL model with A 
tuned to select a single pathway only, ensuring that the active set (see section 2.2.3 1 is very small, 
and model estimation time for each of the R model fits is minimal. 



2.4 Pathway ranking 

Penalized regression typically proceeds by determining an optimal value for A, corresponding to a 
subset of variables that best predicts the response, and this is generally done by cross validating 
the prediction error. In genetic association mapping, results are often instead presented in the 
form of lists of pathways or SNPs, ranked in order of importance. We seek such a strategy for 
the ranking of pathways within the regression model, such that pathways in C, will achieve a 
high ranking, whereas those in C will be ranked low. This approach has the added advantage of 
allowing us to make direct comparisons with alternative pathway methods that use p-values as a 
ranking criterion. 

One simple ranking criterion in penalised regression is to use the order in which each variable 
enters the model along the regularization path - i.e. as A is decreased from its maximal value, where 
no variables are selected. We instead adopt a bootstrap sampling approach, in which we fit the 
regression model over multiple subsamples of the data, drawn with replacement, at a single, fixed 
value for A. Pathways are ranked in order of importance according to their selection frequency 
across subsamples. Our motivation here is to exploit knowledge of finite sample variability obtained 
by subsampling, to achieve better estimates of pathway importance. In this respect our strategy 



resembles the pointwisc stability selection method proposed by Meinshausen and Buhlmann (20101 
in the context of variable selection. 

As with stability selection, for our ranking strategy to be effective, the value of A must be 
small enough to ensure that all pathways in C are selected by the model with a high probability at 
each subsample. Computation time increases rapidly with M, the number of selected pathways, 
so that with the number, \C\, of causal pathways unknown, the choice of M is driven by the 
number of causal pathways we seek to identify within computational constraints. We use B = 100 
subsamples, each of size N/2, and at each subsample we perform a line search over A, to ensure that 
M > M m i n pathways are selected. This procedure is described in appendix[5] Once A is tuned, for 
each subsample, b, we obtain estimates p?' (b — 1, . . . , B) for each SNP coefficient (J = 1, . . . , P*). 

For pathway Qi, we define tt^ = 1 when ||/3;^||2 ^ and 7ij = otherwise, where is the 
pathway parameter vector estimated for subsample b. We rank pathways in order of their selection 
frequency across subsamples, tt^ >,...,> tti l . We note that since typically M <C L, some 7f; may 
be zero. Such pathways are classified as unranked. 



2.5 Ranking performance measures 



In order to evaluate the success of any PGAS method, some measure of ranking performance 
is required. In this section we describe 3 separate ranking performance measures that we use 
to evaluate the performance of our method in a simulation study described in section [3j One 
complicating factor is the issue of overlapping pathways, making the effective number of causal 
pathways, \C\, dependent on the degree to which SNPs in S overlap multiple pathways. In addition, 
with any method based on variable selection, the possibility that causal pathways are unrankcd, 
i.e. they are not selected by the model, must be taken into account. 

Consider the situation where the set S of causal SNPs, with cardinality S > 1, is known. We 
may choose to define C in its most restricted sense as the set of pathways that contain all members 
of S, or alternatively C might include all pathways containing one or more SNPs belonging to iS. 
In either case \C\ will depend on the degree to which SNPs in S overlap multiple pathways. This 
in turn depends on the particular distribution of causal SNPs with respect to overlapping genes. 
The need to accommodate this variability in \C\ in part motivates our formulation of the ranking 
measures described below. 

We propose three separate ranking measures that capture different aspects of ranking perfor- 
mance, and focus on the top 100 ranked pathways only. We do this firstly because in any method 
attention is inevitably focused on the highest ranking pathways (or alternatively those with the 
highest statistical significance in a hypothesis testing framework). Also, since in a simulation study 
we compare the performance of our variable selection method which identifies a limited number 
of pathways against an alternative method that scores all pathways, some suitable cutoff in rank 
order must be chosen. 

We denote the set of ranked causal pathways by C* = {k € C : tt^ > 0}, cardinality |C*|, 
and their respective rankings by r^, rk 2 , ■ ■ ■ ,r\c*\> ranked in order of their respective selection 
frequencies, < 7Tfe 2 <,...,< tt|c*|- We further denote by C{ m = {k e C* : < 100}, 
cardinality |C 100 |, the set of ranked causal pathways falling in the top 100 ranks, with corresponding 
rankings , f fc 2 > • ■ ■ > r |cj 00 | ■ 0m three proposed ranking measures are as follows: 

1. Highest causal pathway rank, r^, that is the single highest rank achieved by any pathway 
in C 10Q . This lies in the range 1 < < 100, and is only defined for |C 10 o| > 1. 



Ranking power, pioo, defined as 

(16) Pioo 



l^ioo I 

\c\ 



with < pioo < 1- Pwo — when no causal pathways are ranked in the top 100 (C* 00 = 0), 
and Piqq = 1 when all causal pathways are ranked in the top 100 (CJ" 00 = C). 

Power-adjusted, normalised, weighted ranking score, R. This takes account of the actual 
rankings, r^, . . . )^|cj 00 |; as weu as t ne ranking power, pioo- We begin by defining a nor- 
malised, weighted ranking score, 

i 

(17) R* = k 

Here the square root increases the weight given to highly-ranked causal pathways. The 
denominator is a normalising factor which represents the minimum possible weighted ranking 
score, with = 1, Tk 2 = 2 . . . , r \c^ 00 \ = l^iool' ensuring that R* attains its minimum value 
of 1 when the pathways in C 100 are optimally ranked. Higher values of R* indicate suboptimal 
ranking. R* takes no account of the possibility that C 100 ^ C, i.e. not all causal pathways 
are ranked. To do this we form the adjusted measure 

(18) R 




R thus attains a minimum value of 1 when all causal pathways are optimally ranked, and 
the value 7 when no causal pathways are ranked. 



3 Simulation Study 

We assess the power of our proposed method in a simulation study using real genotype and pathway 
data, with simulated, quantitative phenotypes generated under an additive genetic model from 
SNPs within a single, selected causal pathway. The presence of overlapping SNPs means that 
the actual number of causal pathways is typically greater than one. We additionally compare 
our method's performance with an alternative, univariate-based method commonly used in gene 
set analysis. Computation times for both methods increase with P, and because of this, and the 
large number of scenarios and simulations tested, we restrict this analysis to SNPs on a single 
chromosome to keep execution times within practical limits. 



3.1 Genotype and pathways data 



We use genotypes obtained from the Alzheimer's Disease Neuroimaging Initiative, ADNI 
loni . ucla . edu/ ADNI ) , derived from the Illumina Human 610-Quad BeadChip. Subjects comprise 
a mix of healthy controls, those diagnosed as having mild cognitive impairment, and those with 
AD. After removing variants with a call rate < 95%, minor allele frequency (MAF) < 0.1 and 
significant deviation from Hardy- Weinberg equilibrium (p < 5.7 x 10 -7 ), 448, 294 SNPs remain. In 
this study we use genotype data from N = 743 subjects, and consider only SNPs from chromosome 
1 (33,850 SNPs). 

Popular databases used for the mapping of genes to biological pathways include the Kyoto 
Encylopedia of Genes and Genomes (KEGG, www. genome . jp/kegg/pathway .html) and BioCarta 



www. biocarta. com/genes/index . asp). For this study we use data on 'canonical pathways' from 
the Molecular Signals Database (MSigDB, www.broadinstitute.org/gsea/msigdb/index.jsp), 
which is a commonly-used, curated collection of pathways obtained from multiple sources. At the 
time of writing this comprised 880 pathways mapped to 6, 804 genes. 2,382 human gene locations 
on chromosome 1, corresponding to assembly GRCh37.p3 are obtained using Ensembl's biomart 
API (www. biomart . org). ADNI-genotyped SNPs on chromosome 1 are then mapped to annotated 
genes within lOkb (20,399 SNPs mapped to 2,096 genes), and these remaining genes and SNPs 
are then mapped to pathways using MSigDB (8,102 SNPs mapped to 778 pathways). Thus we 
see that the majority of chromosome 1 SNPs fail to map to any pathway, but that the majority 
of annotated pathways map to at least 1 SNP on this chromosome. Finally, small (< 10 SNPs) 
and identical pathways are removed. After all pre-processing we are left with a total of P — 8, 078 
SNPs mapped to 551 pathways (max: 1,059; min: 10; mean: 120 ± 142 SNPs per pathway). All 
SNP to pathway mapping and filtering was performed using bespoke code written in Python. The 
mapping and filtering process is illustrated in Fig. [2j 

More than 80% of SNPs are observed to overlap more than 1 pathway, with around 20% 
overlapping 10 or more pathways and 2% overlapping 60 or more (see Fig. [3]). After variable 



expansion to account for overlapping pathways (section 2.1 1, we have P* = 66, 085 SNPs 



3.2 Simulation framework 

We begin by adjusting the pathway weight vector, using the bias-adjusted adaptive weighting 
procedure described in section [2~3} We do this over 10 iterations with R = 40, 000 MC simulations, 
each with response y sampled from a standard normal distribution, Af(0, 1) for simplicity, since 
many quantitative traits are expected to follow a normal distribution. 

For the simulation of a SNP-dependent response, we begin by drawing S SNPs from a single, 
randomly selected causal pathway, G^>, according to some specified distribution (see below), and 
then form the set C, of causal pathways that contain all the members of S. We thus chose to 
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Figure 2: SNP to pathway mapping. 



define C in its most restricted sense, rather than for example including pathways that contain 
one or more SNPs in S. Note that the number, \C\ of causal pathways will vary according to the 
particular distribution of overlaps within S. 

For each simulation, a univariate quantitative phenotype y is simulated using an additive 
model, 

Vi = ^ ^ kXik + 6 
kes 

where Cfc is the allelic effect per minor allele due to causal SNP k. Setting Wk — Cfc^fcj we define 
the effect size of SNP k as 5k = E(wk)/E(y) for k € S, and set e ~ of) so that 5k = when 
C = 0. We also record the average SNP effect size as a proportion of total phenotypic variance, 
ESfe = Var(wfc)/Var(y), and the mean proportionate change in response per minor allele, E(Cfe). 
For our simulations we control 5k, and set accordingly, so that effect size is independent of SNP 
MAF, whereas Cfc and ESfc are MAF-dependent. 

The power and specificity of any PGAS method is likely to depend on a range of factors in- 
cluding the number of causal pathways to be identified, the number and distribution of causal 



SNPs, and the size of their phenotypic effect (Wang et al. 2010 Fridley and Biernacka 2011) 



We therefore assess the performance of our method across 6 different scenarios in which we vary 
each of these factors. Furthermore, we test each scenario over 500 MC simulations to account for 
variation in causal SNP MAFs, gene size and number within pathways, and LD patterns within 
and between causal pathways. 



100 



90 - 
80 - 




10 20 30 40 50 60 

number of pathways per SNP 



Figure 3: Frequency distribution of ADNI SNPs by number of pathways they map to. SNPs are mapped 
to genes within lOkbp. The data set consists of 8, 078 SNPs and 551 pathways. 



scenario 


S 


s k 


distribution 




description 




(a) 


10 


0.005 


random from 




S large; <5^ 


large; random distribn 


(b) 


3 


0.005 


random from 




S small; 6^ 


large; random distribn 


(c) 


3 


0.005 


random from single 


gene in 5$ 


S small; 


large; single gene 


(d) 


10 


0.001 


random from 




S large; 5 k 


small; random distribn 


(e) 


3 


0.001 


random from 




S small; 6^ 


small; random distribn 


(f) 


3 


0.001 


random from single 


gene in Q$ 


S small; <5fc 


small; single gene 



Table 1: Scenarios tested in simulation study. For scenarios (c) and (f), in the rare event that a gene has 
less than 3 SNPs, all SNPs within the gene are selected. 



The list of scenarios tested is presented in Table [T] First, we consider scenarios where the 
number of causal SNPs is small (S = 3) or large (S — 10). Secondly, we consider two different 
SNP effect sizes. We choose values for o\ and Sk to mimic effect sizes obtained in recent association 
studies, focussing particularly on the smallest reported effect sizes. Park et al. ( 2010 ) review GWAS 
for a number of quantitative traits (height, Crohn's disease and breast, prostate and colorectal 
cancers) and report values for ESfc ranging from 0.02 to 0.0004. Cho et al. (20091 report values 



for Cfc for 8 quantitative traits in a large GWAS ranging from 1.6 to 0.006. A recent neuroimaging 
genetic study measuring genetic effects on a variety of traits related to brain structure reports 
significant values for £fc of around 0.07 ( Joyner et al. 2009[ ). We set o e = 0.2, and test Sk = 0.005 
and 0.001, which gives values for ES fc = 0.001 and 0.00004 and E(( k ) = 0.01 and 0.002 respectively. 
Finally, we also vary the particular distribution of SNPs with respect to their location within causal 
pathways. We expect the distribution of causal SNPs with respect to genes and associated LD 
blocks to affect performance, both in our regression model, and in the case where pathway scores 



are derived in a two-step process that begins with the calculation of gene association scores ( Wang 



et al. 2007). The distributions of \C\, the number of causal pathways for each scenario, are shown 
in Fig!~|4j 



4 Results 

We begin with an investigation of the effect of our proposed speed ups to the GL estimation 
algorithm. We first note that GL estimation times will depend on the sample size (N) and the 
number of SNPs (P), which will in turn affect the number of mapped pathways (L) and P* . 
Estimation times will further depend on the number of groups selected (M), and the amount of 
signal present, since this affects convergence times. For illustrative purposes, in Table [2] we show 





Figure 4: Distributions of |C| across 500 MC simulations for the 6 scenarios described in Table [I] Where 
SNPs are distributed within a single gene (scenarios (c) and (f)), the number of causal pathways tends to 
be larger, since a single gene can map to multiple pathways. Where SNPs are distributed randomly across 
C?0 (scenarios (a), (b), (d), and (e)), this number tends to be smaller, particularly where the number of 
causal SNPs is large (scenarios (a) and (d)). 



gains in execution time compared with 'standard' block coordinate descent, using our proposed 
speed ups for a single model fit with a null response, and for M = 10. Estimation times are seen to 
be substantially reduced across a range of values for TV and P, dramatically so for larger datasets. 
We next turn to the application of P-GLAW to real genotype and pathway data described in 



section 2.3 We apply this procedure over 10 iterations, each with R = 40, 000 MC simulations 
with a response y ~ Af(Q, 1). Fig.[H](c) shows how the weight adjustment factor u/ T )/u/ r (see 
( 15 )), varies with di across all pathways at a single iteration. Fi g.[5| (a) and (b) shows the observed 



empirical distribution, II*, using the standard size weighting (13), and the adapted weights (151 
after 10 iterations, respectively. The corresponding KL divergence measure, D, is observed to re- 
duce steadily over the 10 iterations (Pig. [51(d)), illustrating how the proposed weight adjustment 
procedure reduces pathway selection bias. 

For the remainder of this section, we assess the performance of our proposed P-GLAW method 
using simulated phenotypes under the simulation framework described in the previous section, 
and using the bias-adjusted pathway weights described above. We first compare performance 



using the bias-adjusted weights with that obtained using the standard size weighting (13). We 
find the adjusted weighting scheme offers a considerable improvement in ranking performance 
for all ranking measures, and illustrate this in Fig. [6] for a single scenario (scenario (a)) using 
the ranking performance measures described in section |2.4| Fig. [6] (a) shows the first ranking 
measure (r^) as a ROC curve, in which we show the proportion of simulations with < z, 
for ranks z = 1,2, ... , 100. We plot z on the horizontal axis as a false positive rate (FPR), so 
that FPR = (z — l)/L. At a FPR of 0.05, we see that the adapted weighting scheme shows a 
more than 2 fold increase in power (from 0.29 to 0.62) over the standard pathway size weighting 



(13), indicating 62% of MC simulations have < 28, compared with 29% for the standard size 
weighting. The distribution of p 10 o across 500 MC simulations is illustrated as a boxplot in Fig. [6] 
(b). Here we see that the adapted weighting scheme offers a clear and substantial improvement in 
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Table 2: GL estimation times (seconds) with M = 10. Table shows the time taken for the full estimation 
with a null N ~ (0, 1) response, and with varying number of SNPs (P*), and sample size (N). 'BCD' - 
estimation using block coordinate descent only. 'BCD+' - estimation using BCD with active set, Taylor 
approximation of the group penalty and efficient computation of block residuals. Genotype and pathway 
data as described in section |3.1| P* — 4k : 5, 000 SNPs from chromosome 1 mapped to 126 pathways. 
P* — 66fc : all 33, 850 genotyped SNPs from chromosome 1 mapped to 551 pathways. P* = 647fc : 448, 294 
genome-wide SNPs mapped to 879 pathways. All computations performed using multi-threading on a 
single machine with 8 3.2 GHz processors and 64GB RAM. 




Figure 5: Application of bias-adjusted weighting procedure to the data used in the simulation study. 
R = 40, 000, with a different null response, y ~ A/"(0, 1), at each MC simulation, a — 0.98. (a) Empirical 
pathway selection frequency distribution, II*, with standard, pathway size weighting, u>i — >/Si. D = 2.24. 
Dotted horizontal line shows the expected distribution, Hi = 1/L ~ 0.002. (b) II* with bias-adjusted 
weights after 10 iterations. D = 0.12. (c) Variation of weighting adjustment factor u/ T '/ii/ T-1 ) with di 
at a single iteration, with a = 0.98. Each point represents the adjustment to a single wi, I = 1, . . . , L. (d) 
Decrease in K-L divergence, D, over 10 iterations. 



GL's capacity to rank a high proportion of causal pathways in the top 100 (p — 2.03 x 10~ 50 that 
the two population p 10 o CDFs are equal using a two-sample Kolmogorov-Smirnov (KS) test). GL 
with the standard weighting scheme performs particularly poorly with 55% of simulations failing to 
rank any causal pathway in any simulation, compared with 18% for the adapted weighting scheme. 
Finally, Fig. [6] (c) shows the distribution of the R ranking measure across 500 simulations under 
the two weighting schemes. Once again we see that the adaptive weighting scheme demonstrates 
improved ranking performance over the standard size weighting scheme, with the distribution of 
R scores skewed towards lower values for the former, indicating that causal pathways tend to be 
ranked higher. 

We next assess P-GLAW ranking performance with the adapted weighting scheme across the 
full range of scenarios, and compare these with pathway rankings obtained using the method pro- 
posed by Wang et al. (2007), commonly referred to as 'GenGen' (GG). GG is a widely-used, GSEA- 
type PGAS method that measures pathway enrichment using genes scores derived from univariate 
SNP statistics. Studies using GG include searches for implicated pathways in Crohn's disease 



(Wang et al. 2009b I, autism spectrum disorders (Wang et al. 2009a), breast cancer (Menashe 



et al. 20101 and Alzheimer's disease (Lambert et al. 2010). GG begins by scoring each SNP ac 



cording to its association with the phenotype. SNPs are then mapped to genes within a specified 
distance, and each gene is scored according to its most significant mapped SNP. The enrichment 
of highly-ranked genes in a given pathway is then compared with those in all other pathways, 
to obtain a pathway enrichment score. For GenGen we use identical source data (genotypes, 
phenotypes, SNP to gene, and gene to pathway mappings), and rank pathways by normalised 
enrichment score, determined from 1,000 permutations (the GG default settings). MC simulations 
for P-GLAW and GG are performed in parallel across 50 (P-GLAW) and 500 (GG) processors re- 
spectively, on a high-performance computing cluster. As described above for alternative weighting 
schemes, results for the comparison study are presented in the form of ROC curves (Fig. [7]), 
Pioo boxplots (Fig. [8]) and R bar graphs (Fig. [9]). Selected ranking measures are presented in 
numerical form in Tables [3] and |U 
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0.35 
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p = 2.5 x 10~ 25 
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0.44 
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0.00 
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0.30 
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0.48 


p = 7.7 x 10~ 27 
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0.59 


0.27 
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0.01 


37.33 


0.23 


0.50 


0.46 


p = 9.2 x 10 -28 


(f) 


0.79 


0.45 


1.74 


0.31 


0.14 


2.31 


0.06 


0.31 


0.20 


p = 3 x 10~ 38 



Table 3: Selected ranking performance measures for P-GLAW and GG for the 6 scenarios described in 
Table [l] ROC power, fpr = 0.05: proportion of 500 MC simulations with < 28 corresponding to a fpr 
of 0.05. median pioo: median of pioo distribution across 500 MC simulations. Proportion with pioo = 0: 
proportion of 500 MC simulations with no causal pathway in the top 100 ranks. KS 2 sample test: two- 
sample Kolmogorov-Smirnov test of the hypothesis that the P-GLAW and GG pioo population cdfs are 
the same. 



Beginning with the ROC curves illustrating the ranking measure (Fig.[7]and first 3 columns 
of Table [3]), GG consistently demonstrates increased power and specificity across all of the top 100 
ranks illustrated. In addition, the relative gain in power for P-GLAW is greater at the smallest 
effect size for each equivalent scenario, (a) vs. (d), (b) vs. (e), and (c) vs. (f). At the smaller effect 
size, where causal SNPs are distributed randomly within causal pathways, power increases where 
the number of causal SNPs is fewer ((d) vs. (e)). Finally, maximum power is achieved for both 
methods where causal SNPs are located within a single gene ((c) and (f)). 

Turning to the distributions of thepioo ranking measure (Fig.[8j and columns 4 to 9 in Table[3]), 
P-GLAW again outperforms GG across all scenarios. For example, the null hypothesis that the 
two population cdfs are equal is rejected at the a = 0.05 level (Table [3| final column), as is the 
null hypothesis that the two sample medians are the same (Fig. [8]) , except for scenario (a) where 
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Figure 6: Comparison of ranking performance: adaptive weighting scheme (section 2.3 1 vs. standard 



pathway size weighting (13 1. S = 10; 5k = 0.005; SNPs randomly distributed across (a) ROC curves 
illustrating power to identify at least one causal pathway in the top 100. Power is average across 500 
simulations, (b) Distribution of ranking power, pioo, across 500 simulations. This is the proportion 
I C | ioo I / 1 C | of causal pathways in C that arc ranked in the top 100 pathways. Notches indicate 95% 
confidence intervals for the true median, (c) Distribution of the power-adjusted, normalised, weighted 
ranking score, R, across 500 simulations. The final '50+' column includes simulations for which no causal 
pathway was ranked in the top 100, i.e. CJoo = 0; R = 100. 



median pioo is not significantly different for the two methods. Excluding scenario (a) where both 
methods perform relatively well, P-GLAW median pi 00 is consistent across each scenario, and is 
maintained from the larger to the smaller effect size. This is in marked contrast to GG, where 
this measure shows a large decrease at the smaller effect size, although the decrease is less marked 
when causal SNPs are located within a single gene. A similar pattern persists for both P-GLAW 
and GG if we consider the proportion of simulations with pioo = 0, i.e. where no causal pathways 
are found in the top 100 ranks, except for P-GLAW in the case where causal SNPs are located in 
a single gene, where this measure is particularly low. 

The final series of plots (Fig. [9]), illustrate the distributions of R across all scenarios. These 
distributions once again follow the trends in ranking performance highlighted above, but they offer 
a more nuanced view, in the sense that while this measure takes power into account, it is also 
sensitive to the actual causal pathway rankings. Here we see that P-GLAW tends to rank causal 
pathways higher than GG, since all P-GLAW distributions are skewed towards lower R values, 
indicating that causal pathways tend to be ranked higher. This is borne out if we focus on the 
proportion of simulations with R < 10 (Table |4j first 3 columns), which also illustrates how pro- 
portionate gains in ranking performance for P-GLAW over GG are largest for the smallest effect 
size ((a)-(c) vs. (d)-(f)). This table also gives results for the proportion of simulations demon- 
strating near optimal ranking of causal pathways (R < 3), although the very small frequencies 
suggest that little can be inferred from these. 



scenario 


R < 10 






R < 3 






P-GLAW 


GG 


ratio 


P-GLAW 


GG 


ratio 


(a) 


0.68 


0.46 


1.47 


0.13 


0.09 


1.38 


(b) 


0.50 


0.24 


2.11 


0.03 


0.03 


0.93 


(c) 


0.55 


0.33 


1.68 


0.01 


0.07 


0.18 


(d) 


0.44 


0.20 


2.22 


0.03 


0.02 


2.00 


(e) 


0.46 


0.20 


2.33 


0.02 


0.03 


0.69 


(f) 


0.45 


0.23 


1.96 


0.01 


0.04 


0.30 



Table 4: Proportion of 500 simulations with R < 10 and R < 3 for the 6 scenarios described in Table 1 



5 Discussion 

We have developed a penalised regression-based strategy (P-GLAW) that exploits functional struc- 
ture within genotypes to identify biological pathways associated with a continuous trait. We use 
the group lasso, with all mapped SNPs and pathways in a single regression model, and use a novel 
combination of methods including a bias-adjusted group weighting scheme and bootstrap sam- 
pling, together with a number of speed ups designed to make the analysis of large scale datasets 
computationally feasible. An important feature of our method is the need to accommodate the 
presence of overlapping pathways. On the assumption that causal SNPs are enriched within a 
biological pathway, we find in a simulation study that our proposed method shows relative gains 
in both power and specificity across a range of scenarios, compared with an alternative pathways 
method (GG), based on univariate SNP statistics, that we use as a benchmark. We believe this is 
the first such study evaluating GL performance using real SNP and pathway data across a range 
of realistic scenarios. 

One key motivation for a pathways-based approach is the desire to harness the joint effects of 
those SNPs or genes with relatively small effect size, that typically fail to achieve genome-wide 



significance in GWAS (Baranzini et al. 20091. We hypothesise that the advantages inherent in 
a multivariate approach to modelling SNP effects will increase power to detect these, and in our 
simulation study we therefore focus on scenarios with causal SNPs that exhibit effect sizes at or 
below the limits of those found in recent GWAS. To evaluate the performance of each method 
considered here, we devise three separate ranking metrics, each of which measures a different 
aspect of ranking performance. 
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Figure 7: ROC curves illustrating proportion of simulations with r kl < z, for ranks z = 1,2, ...,100. 
Power is average across 500 simulations. False positive rate — (z — 1)/L. Scenarios corresponding to the 
higher SNP effect size (8k = 0.005) are presented in the left-hand column, with the equivalent scenarios 
at the lower effect size (Sk = 0.001) on the right. 
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Figure 8; Box plots of distribution of ranking power, pioo, across 500 simulations. This is the proportion 
|C| 10 o/|C| of causal pathways in C that are ranked in the top 100 pathways. Notches indicate 95% confidence 
intervals for the true median. 
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Figure 9: Distribution of the power-adjusted, normalised, weighted ranking score, R, across 500 simula- 
tions. The final '50+' column includes simulations for which no causal pathway was ranked in the top 
100, i.e. Cloo =%;R= 100. 



One factor affecting power is the 'genetic architecture' of the disease in question, that is the 
number and distribution of SNP effects across causal pathways (Wang et al. 2010). For example, 
causal SNPs may be distributed across many genes in a pathway, or restricted to a single gene. 
Since PGAS methods vary in the way that they combine the effects of individual SNPs, the specific 



genetic architecture is expected to impact power for different methods in different ways (Wang 



et al. 2009b Holmans et al. 20091. GG uses genes scores corresponding to the most significant 



SNP associated with a gene to establish pathway significance. This has the advantage of reducing 
redundant information arising from SNPs in LD with a causal SNP within a single gene, but may 



lead to reduced power where causal variants reside in distinct LD blocks within a gene (Wang 



et al. 2007). An important, related factor that we find has received little attention is the issue of 



overlapping pathways, and the consequent effect on PGAS performance. The precise distribution 
of causal SNPs with respect to genes that overlap multiple pathways will affect the number of 
pathways that are considered to be 'causal', and we expect this to affect ranking performance for 
different methods in different ways. To explore these issues, we investigate a variety of different 
genetic architectures, in which we vary both the number and distribution of causal SNPs with 
respect to pathways and genes. 

In general, we find that P-GLAW performs well across the range of causal SNP distributions and 
effect sizes considered. Additionally, our method is able to consistently outperform the benchmark 
(GG). GG performance at the smaller effect size is particularly weak, so that P-GLAW shows the 
largest gains in relative performance here. 

An insight into some of those factors affecting ranking performance is afforded by considering 
some of the ranking measures in more detail. Starting with the highest ranking causal pathway 
measure (r^), as expected we find that this decreases for each scenario at the smaller effect size. 
However, at the smaller effect size this measure is observed to increase for both methods as the 
number of causal SNPs is decreased, markedly so when the reduced number of causal SNPs are 
concentrated in a single gene. Since the effect size for each causal SNP is held constant, this 
seems counterintuitive, since the pathway 'signal' is reduced when there are fewer causal SNPs. In 
addition, for the reasons described above, for GG this signal may be further reduced where causal 
SNPs reside within a single gene. The explanation is likely to be that the effective number of causal 
pathways tends to increase as the number of causal SNPs is reduced, increasing the probability 
that a single causal pathway is ranked high. The number of causal pathways is even larger when 
causal SNPs are concentrated in a single gene (see Fig. [1]) . Where the pathway signal is highest 
(scenario (a)), both methods tend to rank a high proportion of causal pathways in the top 100 
(high pioo), although the proportion of MC simulations in which GG fails to rank any causal 
pathways (that is the proportion of simulations with pioo = 0) is relatively high. On this measure 
of ranking power, GG performs relatively poorly across all other scenarios, particularly at the 
smaller effect size. Interestingly, P-GLAW is relatively insensitive to variation in the number and 
distribution of SNPs within causal pathways, as might be expected from the smoothing properties 
of the GL £2 penalty, which ensures that all SNPs within a selected pathway are retained in the 



model ( Zhou et al. 2010 ) 



The need to account for factors such as variation in LD, gene and pathway size is a feature 
common to all PGAS methods. A range of approaches, often used in combination, have been 
proposed to correct for these biasing factors, including the use of gene scores that summarise SNP 
statistics (Holmans et al. 20091, and permutation of phenotypes (Wang et al. 2009b). Dimen- 



sionality reduction techniques have also been advocated for the control of redundant information 



Chen et al. 


2010 


Zhu and Li 


2011 


Ballard et al. 


2010 



2010 ). For P-GLAW, we propose a method that 



adjusts the distribution of pathway weights according to the observed bias in pathway selection 
frequencies across multiple MC simulations under the null. We find in a simulation study that our 
proposed bias correction method does substantially increase power and specificity, indicating that 
pathway selection bias is decreased. One potential disadvantage of our approach is that it takes 
no account of variation in biasing factors within a pathway. It would be interesting to compare 
the relative merits of our approach against alternative bias-reduction methods, for example the 
use of within-pathway dimensionality reduction. However, we consider the retention of all SNPs 
in the regression model to be a potentially attractive feature of our approach, as it affords the 



possibility of the simultaneous identification of causal SNPs driving pathway selection, and we are 
currently pursuing this as an extension to the present model. 

In situations where predictors, or groups of predictors are correlated, both the lasso and group 
lasso can demonstrate problems with consistency, that is they are unable to constently identify 



the true set of causal predictors or groups ( Zhao and Yu 2006 Bach 2008 Chatterjee and Lahiri 



2011 ). Despite this, we have demonstrated that in a finite sample, our bootstrap sampling approach 
performs well, and this has been borne out elsewhere (Meinshausen and Buhlrnann 2010). We 



are however pursuing alternative methods for the ranking of pathways, using different ranking 
strategies. 

We pay considerable attention to the need to develop fast algorithms for solving the GL, 
a problem that is particularly acute when using regression models with GWAS data. Using a 
combination of techniques, we establish a GL estimation algorithm that can quickly solve the GL 
using whole genome data. However, the very large number of simulations and scenarios considered 
in our simulation study, and the relatively slow performance of the benchmark method mean that 
we restrict the analysis to mapped SNPs from a single chromosome^ 

We note that phenotypes in our simulation study are generated under an additive linear model. 
The assumption of additive linear SNP effects is built into both the P-GLAW and GG models, in 
the former through the SNP allele codings in the genotype design matrix, and in the latter through 
the particular model used to generate the univariate SNP scores, although for both methods 
alternative models can easily be accommodated. 

In our simulation study we account for variation in the size and distribution of causal SNP 
minor allele frequencies through the use of MC simulations, but we expect that such variation is 
likely to impact model performance, and this is something that warrants further exploration. 

As with all PGAS methods, we note that results are dependent on the choice of pathways 
database, and will inevitably reflect biases due for example to the increased number of annotations 



for genes implicated in particular disease etiologies (Elbers et al. 2009 Cantor et al. 2010). 



Results are also subject to bias resulting from SNP to gene mapping strategies. For example, SNP 
to gene mapping distances will affect the number of unmapped SNPs falling within gene 'deserts' 



( Elcfthcrohorinou et al. 2009), SNPs will map to relatively large numbers of genes in gene rich 



areas of the genome, and the mapping of a SNP to its closest gene may obscure a true functional 



relationships with a more distant gene (Wang et al. 2009b). 



Finally, we note that our method can be easily adapted to accommodate other ways of grouping 



SNP data, for exampling using protein interaction networks (Wu et al. 2010), or GO and other 



ontologies (Jensen and Bork 2010). 



Appendix Line search over A 

We wish to tune A so as to select a minimum M pathways at each subsample. To do this we 
perform a line search over A. This procedure is described in box 3. 

Box 3 Line search procedure for tuning A to select M > M min pathways 



1. Set X max = min A : ||Xfy|| 2 < Xwi (from ^) and a = 0.8+ 

2. Let A = a\ max 

3. Form the active set, A — {m G Q : ||X^y|| 2 < \w m } 

4. Let M = \A\. If M < M min skip to step 6.+ 

5. Solve the GL estimation at A using the active set A, as described in box [2] (starting at box 
[2] step 2.) 



Python code for mapping SNPs to pathways, and for analysing SNP data using PGLAW is available on request. 



Let the solution be (3, with final active set A 

<S(A) = {I e Q : \\$i\\ > 0} (the set of selected pathways) 

M = \S(\)\ (the number of selected pathways) 



6. if M > M min 

[3 is the full solution 
STOP 

else 

\nax = A (need to decrease A) 

end 

7. Go to step 2. 

* The value of a is chosen for computational convenience. A value close to 1 ensures that A values stay 
close to 1, so that as few pathways are selected by the model as possible, thus speeding up the estimation. 
However, a value too close to 1 means that the decrease in A at each iteration is small, meaning that many 
iterations may have to be performed before M reaches the desired range. 

* This step is introduced for computational efficiency, since if _4| < M min there is no prospect of selecting 
enough groups 
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